-
more conservative v0.1.0 annotation
aquacul4
running

cg010_blastn









Also will attempt to clean up first blast output ((~58k hits))

> removing alnlength <100
> removing e-value < E-10
> sorted by query, then evalue. Then removed duplicates on query and query start column. (reduced to 5786)
> repeated with query end value (3891)


Then going with annotation on Galaxy
Galaxy52-[Join_two_Datasets_on_data_32_and_data_50].tabular

new GFF
Annotation1_cg_v010.gff

Blast2gff code looks something like this
./Blast2Gff.pl -i /Volumes/Bay3\ scratch/gff_fun/7 -o /Volumes/Bay3\ scratch/gff_fun/Combined_fosmids_cd_hit_mod_20000_7trim.gff -d "sigenae_v8" -p EXON -s "something"




Flipping it around and taking sigenae v8 and blasting tgagag v0.1.0 on Server (SW) using -G 1 and -E 1 

will modify Blast2gff script to try pull out relevant information.

BLAST COMPLETE
output: http://aquacul4.fish.washington.edu/~steven/filefish/sigena8_blast_v010.txt

7.4 Million lines

will Use Galaxy to filter…




align length >100

;; down to 43,061 lines..

from original
evalue < 0.01

;; down to 110,000 lines


from original
evalue < 0.0001



;; down to 65,842 lines

from there going to trim to > 100 algnlength

;; now at 37,274 lines
Galaxy57-[Filter_on_data_56].tabular


---------
Back to original 

trimming on %ID
c3>=95

about a million lines

---
Now will filter 
the 37274 file (evalue, algnlength)
with 
c3>=95

1105 lines 
Galaxy60-[Filter_on_data_57].tabular


NEW GFF file
sig8_blast_v010_flp.gff

also known as Annotation2_cg_v010
NOTE need to have col 9 indicate "name="

final-Annotation2_cg_v010.gff

--
running an MBD ref map on it. 

--

IDEA need to get know gene structure an validate an approach.